CUDA: remove -sm row, refactor cuBLAS#24216
Conversation
|
Sorry, I accidentally pushed the wrong logic for CDNA + BF16. This is the performance with the correct logic: Performance
|
|
The CI was failing because I had accidentally copied a return statement when restructuring the code. The code paths on which I tested the performance were unaffected. |
6bcdfe5 to
fdf64b8
Compare
| } else if (env_cpp == "f16" || env_cpp == "fp16") { | ||
| compute_type = GGML_TYPE_F16; | ||
| } else if (env_cpp == "bf16") { | ||
| compute_type = GGML_TYPE_BF16; |
There was a problem hiding this comment.
Unlike f16/f32, ggml_get_to_bf16_cuda currently supports only f32 and f16 conversion. So, quantized models will fail with GGML_CUDA_FORCE_CUBLAS and GGML_CUDA_CUBLAS_COMPUTE_TYPE=bf16.
There was a problem hiding this comment.
Thank you for bringing up the issue of GGML_CUDA_FORCE_CUBLAS, looking at the code again there were related issues beyond just a conversion to BF16 - for that it was enough to just add a few more template specializations. For tensors that are contiguously allocated but permuted this PR would have broken the dequantization; but since the layout is homogeneous this can be fixed by applying the contiguous dequantization kernel. Long-term GGML_CUDA_FORCE_CUBLAS should be removed and made an environment variable. The reason it is not already was just to cut down on compilation time but this no longer provides any real benefit.
|
It seems HIP build is failing now due to the bf16 conversion changes. Apart from this, PR looks good to me. |
This PR removes CUDA backend support for split buffers (
--split-mode row) - by now-sm tensorhas all of the necessary features to make it obsolete. Split buffers therefore do not need to be considered for #23935 . Also, it is possible to remove a lot of legacy code that predates the ggml backend API (ggml_cuda_op_mul_mat). I refactored and deduplicated the cuBLAS code to use only a single functions for both batched and non-batched GEMM. The compute type is chosen based on speed, can be overridden withGGML_CUDA_CUBLAS_COMPUTE_TYPE. I did some A/B testing for cuBLAS configuration which unlocked some FP16/BF16 performance for some GPUs.Performance
Requirements